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Abstract 

A dictionary (or map) is a key-value store that requires all keys be unique, and a multimap is a 
key-value store that allows for multiple values to be associated with the same key. We design hashing- 
based indexing schemes for dictionaries and multimaps that achieve worst-case optimal performance 
for lookups and updates, with a small or negligible probability the data structure will require a rehash 
operation, depending on whether we are working in the the external-memory (I/O) model or one of 
the well-known versions of the Random Access Machine (RAM) model. One of the main features 
of our constructions is that they are fully de-amortized, meaning that their performance bounds hold 
without one having to tune their constructions with certain performance parameters, such as the constant 
factors in the exponents of failure probabilities or, in the case of the external-memory model, the size 
of blocks or cache lines and the size of internal memory (i.e., our external-memory algorithms are 
cache oblivious). Our solutions are based on a fully de-amortized implementation of cuckoo hashing, 
which may be of independent interest. This hashing scheme uses two cuckoo hash tables, one "nested" 
inside the other, with one serving as a primary structure and the other serving as an auxiliary supporting 
queue/stash structure that is super-sized with respect to traditional auxiliary structures but nevertheless 
adds negligible storage to our scheme. This auxiliary structure allows the success probability for cuckoo 
hashing to be very high, which is useful in cryptographic or data-intensive applications. 



1 Introduction 

A dictionary (or map) is a key-value store that requires all keys be unique and a multimap |(3]| is a key-value 
store that allows for multiple values to be associated with the same key. Such structures are ubiquitous 
in the "inner-loop" computations involved in various algorithmic applications. Thus, we are interested in 
implementations of these abstract data types that are based on hashing and use 0{n) words of storage, where 
n is the number of items in the dictionary or multimap. 

In addition, because such solutions are used in real-time applications, we are interested in implementa- 
tions that are de-amortized, meaning that they have asymptotically optimal worst-case lookup and update 
complexities, but may have small probabilities of overflowing their memory spaces. Moreover, we would 
like these lookup and update bounds to hold without requiring that we build such a data structure specifically 
"tuned" to certain performance parameters, since it is not always possible to anticipate such parameters at 
the time such a data structure is deployed (especially if the length p{n) of the sequence of operations on 
the structure is not known in advance). For instance, if we wish for a failure probability that is bounded by 
\/rf or l/n'^^°s", for some constant c > 0, we should not be required to build the data structure using an 
amount of space or other components that are parameterized by c. (For example, our first construction gives 
a single algorithm parameterized only by n, and for which, for any s > 0, lookups take time at most s with 
probability that depends on s. Previous constructions are parameterized by c as well, and lookups take time 
c with probability 1 — l/n"^ for fixed constants c, c'.) Likewise, in the external-memory model, we would 



like solutions that are cache-oblivious |21|, meaning that they achieve their performance bounds without 
being tuned for the parameters of the memory hierarchy, like the size, B, of disk blocks, or the size, M, of 
internal memory. We refer to solutions that avoid such parameterized constructions as fully de-amortized. 

By extending and combining various ideas in the algorithms literature, we show how to develop fully de- 
amortized data structures, based on hashing, for dictionaries and multimaps. Without specifically tuning our 
structures to constant factors in the exponents, we provide solutions with performance bounds that hold with 
high probability, meaning that they hold with probability 1 — l/poly(n), or with overwhelming probability, 
meaning that they hold with probability 1 — l/n^^^\ We also use the term with small probability to mean 
l/poly(n), and with negligible probability to mean l/n^^^\ Briefly, we are able to achieve the following 
bounds: 

• For dictionaries, we present two fully de-amortized constructions. The first is for the Practical RAM 
model [33], and it performs all standard operations (lookup, insert, delete) in 0(1) steps in the worst 
case, where all guarantees hold for any polynomial number of operations with high probability. 
The second works in the external-memory (I/O) model, the standard RAM model, and the AC'^ 
RAM model [40], and also achieves 0(1) worst-case operations, where these guarantees hold with 
overwhelming probability. 

• For multimaps, we provide a fully de-amortized scheme that in addition to the standard operations 
can quickly return or delete all values associated with a key. Our construction is suitable for external 
memory and is cache oblivious in this setting. For instance, our external-memory solution returns 
all nfc values associated with a key in 0(1 + Uf^/B) I/Os, where B is the block size, but it is not 
specifically tuned with respect to the parameter B. 

Our algorithms use linear space and work in the online setting, where each operation must be completed 
before performing the next. 

Our solutions for dictionaries and multimaps include the design of a variation on cuckoo hash tables, 
which were presented by Pagh and Rodler [^36] and studied by a variety of other researchers (e.g., see | ,16| 



28 34 1). These structures use a freedom to place each key- value pair in one of two hash tables to achieve 
worst-case constant-time lookups and removals and amortized constant-time insertions, where operations 
fail with polynomially small probability. We obtain the first fully de-amortized variation on cuckoo hash 



tables, with negligible failure probability to boot. 

1.1 Motivations and Models 

Key- value associations are used in many applications, and hash-based dictionary schemes are well-studied 



in the literature (e.g., see |14|). Multimaps 13] are less studied, although a multimap can be viewed as 
a dynamic inverted file or inverted index (e.g., see Knuth f2^). Given a collection, F, of documents, an 
inverted file is an indexing strategy that allows one to list, for any word w, all the documents in T where 
w appears. Multimaps also provide a natural representation framework for adjacency lists of graphs, with 
nodes being keys and adjacent edges being values associated with a key. For other applications, please see 
Angelino et al. [T|. 

Hashing-based implementations of dictionaries and multimaps must necessarily be designed in a com- 
putational model that supports indexed addressing, such as the Random Access Machine (RAM) model 
(e.g., see [14]) or the external-memory (FO) model (e.g., see |[T]|42|). Thus, our focus in this paper is on 
solutions in such models. There are, in fact, several versions of the RAM model, and, rather than insist that 
our solutions be implemented in a specific version, we consider solutions for several of the most well-known 
versions: 

• The standard RAM: all arithmetic and comparison operations, including integer addition, subtraction, 
multiplication, and division, are assumed to run in 0(1) timeF] 

• The Practical RAM 133]: integer addition, as well as bit-wise Boolean and SHIFT operations on 
words, are assumed to run in 0(1) time, but not integer multiplication and division. 

• The AC^ RAM ||40| : any AC'' function can be performed in constant time on memory words, including 
addition and subtraction, as well as several bit-level operations included in the instruction sets of 
modem CPUs, such as Boolean and SHIFT operations. This model does not allow for constant-time 
multiplication, however, since multiplication is not in AC'' ||20J . 

Thus, the AC" RAM is arguably the most realistic, the standard RAM is the most traditional, and the 
Practical RAM is a restriction of both. As we are considering dictionary and multimap solutions in all of 
these models, we assume that the hash functions being used are implementable in the model in question 
and that they run in 0(1) time and are sufficiently random to support cuckoo hashing. This assumption 
is supported in practice, for instance, by the fact that one of the most widely-used hash functions, SHA-1, 



can be implemented in 0(1) time in the Practical RAM model. See also Section 3.5 for discussion of the 
theoretical foundations of this assumption. 

A framework that is growing in interest and impact for designing algorithms and data structures in the 
external-memory model is the cache-oblivious design paradigm, introduced by Frigo et al. fT\\. In this 
external-memory paradigm, one designs an algorithm or data structure to minimize the number of I/Os 
between internal memory and external memory, but the algorithm must not be explicitly parameterized by 
the block size, B, or the size of internal memory, M. The advantage of this approach is that one such 
algorithm can comfortably scale across all levels of the memory hierarchy and can also be a better match for 
modern compilers that perform predictive memory fetches. 

Our notion of a "fully de-amortized" data structure extends the cache-oblivious design paradigm in two 
ways. First, it requires that all operations be characterized in terms of their worst-case performance, not 
its amortized performance. Second, it extends to internal-memory models the notion of avoiding specific 
tuning of the data structure in terms of non-essential parameters. Formally, we say that a data structure is 
fully de-amortized if its performance bounds hold in the worst case and the only parameter its construction 
details depend is n, the number of items it stores. Thus, a fully de-amortized data structure implemented in 
external-memory is automatically cache-oblivious and its I/O bounds hold in the worst case. 



'Most, but not all, researchers also assume that the standard RAM supports bit- wise Boolean operations, such as AND, OR, and 
XOR; hence, we also allow for these as constant-time operations in the standard RAM model. 



In addition to being fully de-amortized, our strongest constructions have performance bounds that hold 
with overwhelming probability. While our aim of achieving structures that provide worst-case constant 
time operations with overwhelming probability instead of with high probability may seem like a subtle 
improvement, there are many applications where it is essential. In particular, it is common in cryptographic 
applications to arm for negligible failure probabilities. For example, cuckoo hashing structures with negli- 
gible failure probabilities have recently found applications in oblivious RAM simulations [22]. Moreover, a 
significant motivation for de-amortized cuckoo hashing is to prevent timing attacks and clocked adversaries 
from compromising a system ||5|. Finally, guarantees that hold with overwhelming probability allow us to 
handle super-polynomially long sequences of updates, as long as the total number of items resident in the 
dictionary is bounded by n at all times. This may be useful in long-running or data-intensive applications. 

1.2 Previous Related Work 

Since the introduction of the cache-oblivious framework by Frigo et al. | [2T| , several cache-oblivious al- 
gorithms have subsequently been presented, including cache-oblivious B -trees fT|, cache-oblivious binary 
search trees [9|, and cache-oblivious sorting [10|. Pagh et al. [37 1 describe a scheme for cache-oblivious 
hashing, which is based on linear probing and achieves 0(1) expected-time performance for lookups and 
updates, but it does not achieve constant time bounds for any of these operations in the worst case. 

As mentioned above, the multimap ADT is related to the inverted file and inverted index structures, 
which are well-known in text indexing applications (e.g., see Knuth [26]) and are also used in search engines 
(e.g., see Zobel and Moffat [|44|). Cutting and Pedersen [15| describe an inverted file implementation that 
uses B-trees for the indexing structure and supports insertions, but doesn't support deletions efficiently. 
More recently, Luk and Lam [31 1 describe an internal-memory inverted file implementation based on hash 



tables with chaining, but their method also does not support fast item removals. Lester et al. [29 30 1 and 



Biittcher etal. [ 13 1 describe external-memory inverted file implementations that support item insertions only. 
Biittcher and Clarke [[T2| consider trade-offs for allowing for both item insertions and removals, and Guo et 
al. [|23l give a solution for performing such operations by using a B-tree variant. Finally, Angelino et al. [T| 
describe an efficient external-memory data structure for the multimap ADT, but like the above-mentioned 
work on inverted files, their method is not cache-oblivious; hence, it is not fully de-amortized. 

Also as mentioned above, our solutions include the design of a variation on cuckoo hash tables, which 
were presented by Pagh and Rodler ["361 and studied by a variety of other researchers (e.g., see |[T6||28J|34|). 
These structures use a freedom to place each key-value pair in one of two hash tables to achieve worst-case 
constant-time lookups and removals and amortized constant-time insertions with high probability. Kirsch 



and Mitzenmacher [24| and Arbitman et al. [5] study a method for de-amortizing cuckoo hashing, which 
achieves constant-time lookups, insertions, and deletions with high probability, and uses space (2 + e)n for 
any constant e > (as is standard in cuckoo hashing). These methods are not fully de-amortized, however, 
since, in order to achieve a failure probability of 1/n^, they construct an auxiliary structure consisting 
of 0{c) small lookup tables. In contrast, neither of our dictionary constructions are parameterized by c; 
furthermore our second construction provides guarantees that hold with overwhelming probability rather 
than with high probability. In a subsequent paper, Arbitman et al. [6| study a hashing method that achieves 
worst-case constant-time lookups, insertions, and removals with high probability while maintaining loads 
very close to 1 , but their method also is not fully de-amortized. 

Kirsch, Mitzenmacher, and Wieder ["251 introduced the notion of a stash for cuckoo hashing, which 
allows the failure probability to be reduced to 0(l/n"), for any a > 0, by using a constant-sized adjunct 
memory to store items that wouldn't otherwise be able to be placed. One of our novel additions in this 
paper is to demonstrate that by using super-constant sized stashes, along with a variation of the q-heap data 
structure, we can ensure failures happen only with negligible probability while maintaining constant time 
lookup and delete operations. 



1.3 Our Results 

In this paper we describe efficient hashing-based implementations of the dictionary and multimap ADTs. 
Our constructions are fully de-amortized and are alternately designed for the external-memory (I/O) model 
and the well-known versions of the RAM model mentioned above. Because they are fully de-amortized, our 
external-memory algorithms are cache-oblivious. 

We begin by presenting two new fully de-amortized cuckoo hashing schemes in Section [2] Both of 
our constructions provide 0(1) worst-case lookups, insertions, and deletions, where the guarantees of the 
first construction (in the Practical RAM model) hold with high probability, and the guarantees of the second 
construction (in the external-memory model, the standard RAM model, and the AC*^ RAM model) hold with 
overwhelming probability. Moreover, these results hold even if we use polylog(n)-wise independent hash 
functions. Like the construction of Arbitman et al. [T], both of our dictionaries use space (2 + e)n for any 
constant e > 0, though when combined with another result of Arbitman et al. [6], we can achieve (1 + e)n 



words of space for any constant e > (see Section 3.5 1. Our second dictionary can be seen as a quantitative 



improvement over the previous solution for de-amortized cuckoo tables |(5|, as the guarantees of Q only 
hold with high probability (and their solution is not fully de-amortized). 

Both of our dictionary constructions utilize a cuckoo hash table that has another cuckoo hash table as 
an auxiliary structure. This secondary cuckoo table functions simultaneously as an operation queue (for the 
sake of de-amortization |[5j|6j|^) and a stash (for the sake of improved probability of success [25]). 

Our second construction also makes use of a data structure for the AC° RAM model we call the atomic 
stash. This structure maintains a small dictionary, of size at most 0(log^' ^ n), so as to support constant- 
time worst-case insertion, deletions, and lookups, using 0(log^ " n) space. This data structure is related 



to the q-heap or atomic heap data structure of Fredman and Willard pO| (see also |43|), which requires 
the use of lookup tables of size 0{rf) and is limited to sets of size 0(log^' ^ n). Andersson et al. [2] and 
Thorup 1 40 1 give alternative implementations of these data structures in the AC*^ RAM model, but these 
methods still need precomputed functions encoded in table lookups or have time bounds that are amortized 
instead of worst-case. Our methods instead make no use of precomputed lookup tables, hold in the worst 
case, and use techniques that are simpler than those of Anderson et al. and Thorup. We emphasize that 
our results do not depend on our specific atomic stash implementation; one could equally well use q-heaps, 
atomic heaps, or other data structures that allow constant-sized lookups into data sets of size w(l) (under 
suitable assumptions, such as keys fitting into a memory word) in order to obtain bounds that hold with 
overwhelming probability in the standard or AC*^ RAM models. We view this combination of cuckoo hash 
tables with atomic stashes (or other similar data structures) as an important contribution, as they allow 
super-constant sized stashes for cuckoo hashing while still maintaining constant time lookups. 

In Section |4} we also show how to build on our fully de-amortized cuckoo hashing scheme to give an 
efficient cache-oblivious multimap implementation in the external memory model. Our multimap imple- 
mentation utilizes two instances of the nested cuckoo structure of Section[2j together with arrays for storing 
collections of key-value pairs that share a common key. This implementation assumes that there is a cache- 
oblivious mechanism to allocate and deallocate power-of-two sized memory blocks with constant-factor 
space and I/O overhead; this assumption is theoretically justified by the results of Brodal et al. [11| . A lower 
bound of Verbin and Zhang [41] implies that our bounds are optimal up to constant factors in the external 
memory model, even if we did not support fast find All and remove All operations. 

In addition to a theoretical analysis of our data structures, we have performed preliminary experiments 
with an implementation, and a later writeup will include full details of these. 

The time bounds we achieve for the dictionary and multimap ADT methods are shown in Table [T] Our 
space bounds are all 0{n), for storing a dictionary or multimap of size at most n. 



Method 


Dictionary I/O Performance 


Multimap I/O Performance 


add(A:, v) 


0(1) 


o(i) 


containsKey(A;) 


0(1) 


o(i) 


containsItem(fc,w) 


0(1) 


o(i) 


remove(fc,i;) 


0(1) 


o(i) 


get(fc)/getAll(fc) 


0(1) 


0{l + nk/B) 


re move All (fc) 


- 


o(i) 



Table 1 : Performance bounds for our dictionary and multimap implementations, which all hold in the worst- 
case with overwhelming probability, except for implementations in the Practical RAM model, in which case 
the above bounds hold with high probability. These bounds are asymptotically optimal. We use B to denote 
the block size, and n^ to denote the number of items with key equal to k. 

2 Nested Cuckoo Hashing 

In this section, we describe both of our nested cuckoo hash table data structures, which provide fully 
de-amortized dictionary data structures with worst-case 0(l)-time lookups and removals and 0(l)-time 
insertions with high and overwhelming probability, respectively. At a high level, this structure is similar to 
that of Arbitman et al. Q and Kirsch and Mitzenmacher [24], in that our scheme and these schemes use 
a cuckoo hash table as a primary structure, and an auxiliary queue/stash structure to de-amortize insertions 
and reduce the failure probability. But our approach substantially differs from prior methods in the details 
of the auxiliary structure. In particular, our auxiliary structure is itself a full-fledged cuckoo hash table, with 
its own (much smaller) queue and stash, whereas prior methods use more traditional queues for the auxiliary 
structure. 

Our two dictionary constructions are identical, except for the implementation of the "inner" cuckoo hash 
table's queue and stash. Below, we describe both constructions simultaneously, highlighting their differences 
when necessary. 



2.1 The Components of Our Structure 

Our dictionaries maintain a dynamic set, S, of at most n items. Our primary storage structure is a cuckoo 
hash table, T, subdivided into two subtables, Tq and Ti, together with two random hash functions, /iq and 
hi. Each table Ti stores at most one item in each of its m cells, and we assume m > {1 + e)n where n 
is the total number of items, for some fixed constant e > 0. For any item x = {k, v) that is stored in T, x 
will either be stored in cell To[^o(^)] or in cell Ti[hi{k)\. Some of the items in S may not be stored in T, 
however. They will instead be stored in an auxiliary structure, Q. 

The structure Q is simultaneously two double-ended queues (deques), both of which support fast en- 
queue and dequeue operations at either the front or the rear, and a cuckoo hash table with its own queue 
and stash. Because Q is a cuckoo hash table, it supports worst-case constant-time lookups for items based 
on their keys. The two deques correspond to the queue and stash of the primary structure; we call them 
OuterQueue and OuterStash respectively. Each item x that is stored in Q is therefore also augmented with 
a prev pointer, which points to the predecessor of x in its deque, and a next pointer, which points to the 
successor of x in its deque. We also augment both deques with front and rear pointers, which respectively 
point to the first element in the deque and the last element in the deque. 

The "inner" cuckoo hash structure consists of two tables, Rq and Ri, each having m?/'^ cells, together 
with two random hash functions, /o and /i, as well as a small list, L, which is used to implement two 
deques that we call InnerQueue and InnerStash respectively. Each item x = {k, v) that is stored in Q will 



be located in cell Ro[fo{k)] or in cell Ri[fi{k)], or it will be in the list L. The only manner in which our 
two constructions differ is in the implementation of L. 

In summary, our dictionary data structure consists simply of the "outer" cuckoo table T with a queue 
and stash, and the "inner" cuckoo table Q, which contains its own queue and stash. 

2.2 Operations on the Primary Structure 

To perform a lookup, that is, a get(fc), we try each of the cells To[ho{k)], Ti[hi{k)], Ro[fo{k)], and 
Ri[fi{k)], and perform a lookup in L, until we either locate an item, x = {k,v), or we determine that 
there is no item in these locations with key equal to k, in which case we conclude that there is no such item 
inS. 

Likewise, to perform a remove(fc) operation, we first perform a get(A;) operation. If an item x = {k, v) 
is found in one of the locations To[/io(fc)] or Ti [hi{k)], then we simply remove this item from this cell. If 
such an x is found in i?o[/o(^)]> or Ri[fi{k)], or in the structure L, then we remove x from this cell and we 
remove x from its deque(s) by updating the prev and next pointers for re's neighbors so that they now point 
to each other. 

Performing an add(A;, v) operation is somewhat more involved. We begin by performing an enqueueLast(x, 
0, OuterQueue) operation, which adds x = {k, v) at the end of the deque OuterQueue, together with the 
bit to indicate that x should next be inserted in Tq. Then, for constants a, a' > 1, set in the analysis, 
we perform a (outer) insertion substeps, followed by a' (inner) insertion substeps. Each outer insertion 
substep begins by performing a dequeueFront(OuterQueue) operation to remove the pair {{k, v), b) at the 
front of the deque. If the cell Tt[/i;,(fc)] is empty, then we add (A;, v) to this cell and this ends this substep. 
Otherwise, we evict the current item, y, in the cell Th[hb{k)], replacing it with {k,v). If this insertion 
substep just created a second cycle in the insertion process (which we can detect by a simple marking 
scheme, using 0(log^ n) space with overwhelming probability rl then we complete the insertion substep 
by performing an enqueueLast(y, 6', OuterStash) operation, where b' = {b + 1) mod 2. If this insertion 
substep has not created a second cycle, however, then we complete the insertion substep by performing an 
enqueueFirst(y, 6', OuterQueue) operation, where b' = {b + 1) mod 2, which adds the pair {y,b') to the 
front of the primary structure's queue. Thus, by design, an add operation takes 0{a) = 0(1) time in the 
worst case, assuming the operations on Q run in 0(1) time and succeed. 

Finally, similar to [4], every m^/^ operations we try to insert an element from the stash by performing a 
dequeueFront(OuterStash) operation to remove the pair {{k,v),b) at the front of the deque and then spending 
2 moves trying to insert (A;, v). If no free slot is found, we return the current element from the connected 
component of {k, v) to the front of the stash. This ensures that items belonging to connected components 
that have had cycles removed via deletions do not remain in the stash long after the deletions occur. All 
constants mentioned throughout are chosen for theoretical convenience; no effort has been made to optimize 
them. 

2.3 Operations on the Auxiliary Structure 

As mentioned above, the auxiliary structure, Q, is a standard cuckoo table of size m' = m?'^ augmented 
with its own (inner) queue and stash maintained via the structure L, and pointers to give Q the functionality 
of two double-ended queues, called OuterQueue and OuterStash. The enqueue and dequeue operations to 
OuterQueue and OuterStash, therefore, involve standard 0(l)-time pointer updates to maintain the deque 
property, plus insertion and deletion algorithms for the inner cuckoo table. Our inner insertion algorithm 
is different from our outer one in that we do not immediately place items on the back of the inner queue. 
Instead, on an insertion of item (A;, v), we spend 16 steps trying to insert (A;, v) into the inner cuckoo table 



|5| refers to this as a cycle-detection mechanism, and notes there are many possible instantiations. See also |25| 



immediately. If the insertion does not complete in 16 steps, we place the current element from the connected 
component of {k, v) in the back of InnerQueue. The reason for this policy is that, as we will show, with 
overwhelming probability almost every item in the inner table can be inserted in 16 steps, due to the extreme 
sparsity of the table. Finally, every m^'^ "inner" operations we additionally spend 16 moves trying to insert 
an element from the front of InnerQueue and a single move trying to insert an element from the front of 
InnerStash, returning elements to the back of InnerQueue or InnerStash in the event that a vacant slot has 
not been found. The purpose of this is to ensure that items whose connected components have shrunk due 
to deletions do not remain in the inner queue or stash unnecessarily. 

The only manner in which our two constructions differ is in the implementation of L, which supports 
the inner stash and inner queue. In our first construction, for the Practical RAM model, L is simply two 
double-ended queues (one for InnerQueue, and one for InnerStash), and when performing a lookup in L, 
we simply look at all the cells in both deques. In our second construction, we implement L using a data 
structure we call the atomic stash. This structure maintains a small dictionary, of size at most 0(log^' ^ n), 
so as to support constant-time worst-case insertion, deletions, and lookups, using 0(log^'^ n) space. Thus, 
in our second construction, the structure L is simultaneously two dequeues and an atomic stash to enable 
fast lookups into L. Details are in Appendix [A] 

3 Analysis 

The critical insights into why such a standard approach is sufficient to support fully de-amortized constant- 
time updates and lookups for the auxiliary structure, Q, are based on enhancing the analysis of previous 
work for cuckoo hash tables. 

Definition 1: A sequence vr of insert, delete, and lookup operations is n-bounded if at any point in time 
during the execution of it the data structure contains at most n elements. 

We prove the following two theorems. 

Theorem 1: Let C be the inner cuckoo hash table consisting of two tables of size m?/'^, together with a 
queue/stash L as described above. For any polynomial p{n), any m^'^-bounded sequence of operations vr' 
on C of length p{n), and any s < m}-!^ , every insert into C completes in 0(1) steps, and L has size at most 
s, with probability at least 1 — p{n)/mP'^^> . 

Theorem [T] states that with overwhelming probability, all insertions performed on the auxiliary structure 
Q will run in 0(1) time, and all of the inner table's internal data structures will stay small (ensuring fast 
lookups), provided we never try to put more than m}/^ items in Q. The following theorem addresses this 
size condition. 

Theorem 2: Let T be a cuckoo hashing scheme consisting of two tables of size m. Let m > (1 + e)n, for 
some constant e > 0, and let Q be the queue/stash as described above. For any polynomial p{n) and any 
n-bounded sequence of operations vr of length p{n), the number of items stored in Q will never be more 
than 2 log^ n, with probability at least 1 - l/n^(^°s «) 

3.1 Notation and Preliminary Lemmata 

A standard tool in the analysis of cuckoo hashing is the cuckoo graph G. Specifically, given a cuckoo hash 
table T, subdivided into two tables To and Ti, each with m cells, and hash functions /iq and hi, the cuckoo 
graph for T is a bipartite multigraph (without self-loops), in which left vertices represent the table cells in 
Tq and right vertices represent the cells in Ti . Each key x inserted into the hash table coiTcsponds to an 
edge connecting To[/io(a;)] to Ti[hi{x)\. Thus, if S denotes the set of n items in T, and each of Tq and Ti 
have m cells, the cuckoo graph G{S, ho, hi) contains 2m nodes {m on each side) and n edges. Given a 



cuckoo graph G{S, ho, hi), we use Cs.hoM (v) to denote the number of edges in the connected component 
that contains v. 

In our analysis, there will actually be two cuckoo graphs G and G'. G corresponds to the outer table, 
and has vertex set [m] x [m] and at most n edges at any point in time. G' corresponds to the inner table 
and has vertex set [m'] x [m'] for m' = w?'^, and (we will show) n' < m^'^ edges with overwhelming 
probability at any point in time. 

We will make frequent use of the following lemmata. 

Lemma 3: Let G{S, ho, hi) be a cuckoo graph with n edges and vertex set [m] x [m]. Then 

Pr(C.,.,.(„)>.)<(^)'i, 

where the probabiUty is over the choice of /iq and hi. 

Proof: By standard arguments, (e.g. |17, Lemma 1]), PT{Cs,ho,hi{v) > k) < Pr(Bin(nA;, 1/m) > k). The 
lemma follows by an easy calculation. ■ 

We now present some basic facts about the stash. The first relates the time it takes to determine whether 
an element needs to be put in the stash on an insertion. 

Lemma 4: [4. Claim 5.4] For any element x and any cuckoo graph G{S, /iq, hi), the number of moves 
required before x is either placed in the stash or inserted into the cuckoo table is at most 2Cs,ho,hi (x). 

f25\ Lemma 2.2] shows that the number of keys that must be stored in the stash corresponds to the 
quantity e{G{S, Hq, hi)), defined as follows. For a connected component H of G{S, Hq, hi), define the 
excess e{H) := #edges(//) — ^nodes{H), and define 

e{G{S,ho,hi)) :=^max{e{H),0), 

H 

where the sum is over all connected components H of G{S, /iq, ^i)- 

Given a vertex in this random graph, recall that Gs^HqM (v) denotes the number of edges in the connected 
component that contains v, and let B^ be the number of edges in the component of v that need to be removed 
to make it acyclic {By is also called the cyclotomic number of the component containing v). Notice that 
By = e{H) + 1, that is. By is 1 more than the number of keys from u's component that need to be placed in 
the stash. 

We use the following lemma from |[25|. 

Lemma 5: Let \S\ = n with (1 + e)n < m for some constant e, and let G{S, ho, hi) be a cuckoo graph 
with vertex set [m] x [m] . Then 

^0 jLo \ 



Fv{By>t\GsMM(.^) = k)<[^^j 



3.2 Proving Theorem [T] 

Throughout this section, we use the following notation, following that in |[5|. Let vr' be an ?n,^/'^-bounded 
sequence of p(n) operations. Denote by (xi , . . . , Xp(^n) ) the elements inserted by vr'. For any integer < i < 
p{n) — rri^/^, let Si denote the set of elements that are stored in the data structure just before the insertion of 
Xi, and let Si denote Si together with the elements {xj, Xj+i, . . . , x^^^i/a}, ignoring any deletions between 
time i and time i + m}'^. Since vr' is an m^/^ bounded sequence, we have |S'j| < m}/^ and IS*!! < 2m^/^ 
for all i. 

We need the following lemma, which states that all components of G' are tiny. 



Lemma 6: Let \S\ = n', and let G'{S, /o, /i) be any cuckoo graph with vertex set [m'] x [m']. Assume 
n'/m' < 2m^'^. Then for any k < m^'^, and for any node v, Pi^[Cs.foji{v) > k] < m^^^^' . 

Proof: Lemma [3] implies that 

2k \'' I 



-n{k) 



^riCsj,,f.iv)>k)<^-^^^ fe!- 

For k < m^l'^ the probability that Csjoji{v) > /c is at most m^^^^^; applying a union-bound over all v, 
the probability some connected component has size greater than k is still at most m 



3.2.1 Showing InnerStash Stays Small 

We begin by showing that InnerStash has size at most s at all points in time with probability at least 1 — 
l/mP'^^\ This requires extending the analysis of prior works |25 27J to superconstant sized stashes, while 
leveraging the sparsity of the inner cuckoo table C. 

Lemma 7: Let \S\ = n' , and let G'{S, /o, /i) be any cuckoo graph vertex set [m'] x [m']. Assume n' < 
2,^1/3 ^^^ j^/ _ 77j2/3 pgj. ^j^y g ^ 77ji/6^ yyjfh probability 1 — 1/m^^'^' the total number of vertices that 
reside in the stash ofG'{S, /o, /i) is at most s. 



The proof is entirely the same as that for the outer stash, which we prove in Lemma 12 We defer the 
proof until then. 

We are in a position to show formally that the inner stash is small at all points in time. 

Lemma 8: Let vr' be any m^'^-bounded sequence of operations of length p{n) on the inner cuckoo table C. 
For any s < ?n,^/^^, with probability 1 — p{n) / imP^^'> over the choice of hash functions /o, /i, InnerStash 
has size at most s at all points in time. 

Proof: Following the approach of |T|, we define a good event ^i that ensures that InnerStash has size at 
most s. Namely, let ,^i be the event that at all points i in time, the number of vertices v in the stash of 
G'{Si, /o, /i) is at most s. By Lemma[7l and a union bound over all p{n) operations in vr', ^i occurs with 
probability at least 1 — p{n) / rmP'^^' . 

We distinguish between the real stash and the effective stash. The real stash at any point i in time is the 
set of nodes that reside in the stash of G' {Si, fo, fi)', the effective stash refers to the real stash, together with 
elements that used to be in the real stash but have since had cycles removed from their components due to 
deletions, and have not yet been inserted in the inner cuckoo table. Let Ei denote the set of items in the 
effective stash but not the real stash of G'{Si, /o, /i)- 

Cleai^ly the event ^i guarantees that at any point in time, the size of the real stash is at most s. To see that 
the size of the effective stash never exceeds s, assume by way of induction that for j > m^/^, the effective 
stash has size at most s at all times less than j (clearly event ^i guarantees that this is true for all j < m^'^ 
as a base case). In particular, the inductive hypothesis ensures that the effective stash has size at most s at 
time j — ra^/^. Observe that during the execution of operations {x _^i/3, x ■_^i/3_,_i, . . . , Xj}, we spend 

^^\yg > s^ moves in total on the elements of the effective stash. By Lemma 6 and a union bound of all 
i < p{n), we may assume all connected components of G'{Sj, fo, /i) have size at most s/2; this adds a 
failure probability of at most p{n)/mP'^ "i to the result. Thus, by LemmaUJthe insertion of any item x in the 
effective stash never requires more than s moves before it succeeds or causes x to be returned to the back 
of the stash. It follows that by time j, we have spent at least s moves on each of the items in £■ _,„i/3, and 
hence all elements in £■ -,^1/3 are inserted into the table by time j. Thus, at time j the size of the effective 
stash is at most s, and this completes the induction. 



3.2.2 Showing InnerQueue Stays Small 

The bulk of our analysis relies on the following technical lemma. 

Lemma 9: Let \S\ = n' , and let G'{S, /o, /i) be any cuckoo graph with vertex set [m'] x [m']. Assume 
n'/m' < 2m^^^. For any < s < m}/^, with probability 1/m^^^^ the total number of vertices v G 

G'{S, fo,fi) with Csjoji (v) > 8 is at most s. 

Proof: Lemma m^ implies that the probability that all connected components have size at most ^/s is at least 
1 — m^^^^h We assume for the remainder that there are no components of size greater than y/s; this adds 
at most a probability l/m^^^^ of failure in the statement of the lemma. 

Pick a special vertex Vi in each component Cj of size greater than 8 at random, and assign to Vi a 
"weight" equal to the number of edges in Cj. We show with probability at least 1 — m~^^^' we cannot find 
a set of weight s; the lemma follows. 

Since no component has size more than ^/s, we would need to find at least j = s/ y/s = y/s vertices that 
are in components of size greater than 8 and are the special vertex for that component. We use the fact that 
for any fixed set of j distinct vertices vi . . .Vj, the probability (over both the choice of G' and the choice 
of the special vertex for each component) that all j vertices are the special vertices for their component is 
upper bounded by Pr[Cs'jQ j^ {v) > 8]^ . Indeed, write Ei for the event vi, . . . f j_i are all special. We may 

write 

j 

Pr[t;i . ■ -Vj are all special ] < TT Pr[vj is special \Ei]. 

i=l 

To bound the right hand side, notice that 
Pr[f j is special \Ei] < Pr[wj is in a different component from fi, . . . Vi-i and Csjgj-^ (v) > 8\Ei\ 

<Pr[CsjojAv)>8], 

where the last inequality holds because because the density of edges after taking out the earlier (larger than 
expected) components is less than the density of edges a priori. 

Thus, taking a union bound over every set {vi, . . . ,Vj} of j = y/s vertices, the probability that we can 
find such a set is at most 



(Pr[i;i, ...Vj are all special ]) < ( "^ j (Pr[C5,/o,/i(^) > 8]^ 



-WV8!m8/3y' -"" 
where the second inequality follows by Lemma|3] ■ 

We are in a position to show formally that the inner queue is small at all points in time. 

Lemma 10: Let -k' be any m}/^ -bounded sequence of operations of length p{n) on the inner cuckoo table C. 
For any s < m}'^, with probability 1 — p{n) / rmp^'^' over the choice of hash functions /o, /i, InnerQueue 
has size at most s at all points in time. 

Proof: Following the approach of [5], we define a good event ^2 that ensures that InnerQueue has size 
at most s. Namely, let ^2 be the event that at all points i in time, the number of vertices v ^ G' with 
Gg, r r {v) > 8is at most s. By Lemmalol and a union bound over all p{n) operations in vr', ,^2 occurs with 

probability at least 1 — p{n) / rinP-^^\ 
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We distinguish between the real queue and the effective queue. The real queue at any point i in time 
is the set of nodes v such that Cs^j^j-^ (v) > 8; the effective queue refers to the real queue, together with 
elements that used to be in the real queue but have since had their components shrunk to size at most 8 via 
deletions, and have not yet been inserted in the inner cuckoo table. Let Ei denote the set of items in the 
effective queue but not the real queue at time i. 

Clearly the event ^2 guarantees that at any point in time, the size of the real queue is at most s. To see 
that in fact the size of the effective queue never exceeds s, assume by way of induction that for j > m}'^, 
the effective queue has size at most s at all times less than j (clearly event ^2 guarantees that this is true for 
all j < m}'^ as a base case). In particular, the inductive hypothesis ensures that the effective queue has size 
at most s at time j — m}/^. Observe that during the execution of operations {x • ^1/3 , 2;,_f„i/3_,_i, • • • ' ^ii' 

we spend 16^^\/g > 16s moves on the elements of the effective queue, with at least 16 moves devoted to 
each element. By definition of the real queue, combined with Lemma [4| at most 16 moves are required to 
insert each element of E-^i/s , and thus all elements in E-^i/s are inserted into the table by time j. Thus, 
at time j the size of the effective queue is at most s, and this completes the induction. ■ 



3.2.3 Putting it together 

By design, every insert into C terminates in 0(1) steps. Combining Lemmata [S] and 10 with probability 
1 —p{n)/m''^^^> , both InnerQueue and InnerStash have size at most s/2 at all times. Since L only contains 
items from the inner queue and inner stash, it follows that L never contains more than s items. This proves 
Theorem [U 

3.3 Proving Theorem |2] 



Throughout this section, we use the following notation, following that in Section 3.2 Let vr be an n- 
bounded sequence of p{n) operations. Denote by (xi, . . . ,Xp(„)) the elements inserted by vr. For any 
integer < i < p(n), let Si denote the set of elements that are stored in the data structure just before the 
insertion of Xj, let Si denote 5, together with the elements {xi, Xj+i, . . . , x^^j e ^}, ignoring any deletions 
between time i and time i + log^n, and let Si denote Si together with the elements {xjjXj+i, . . . ,Xj^^i/2}, 
ignoring any deletions between time i and time i + m^' ^ (treating any operations past time p{n) as empty). 
Since vr is an n bounded sequence, we have \Si\ < n, I^jI < n + log^ n, and \Si\ < n + m^l'^ for all i. 

In proving Theorem [2] it suffices to show that neither the stash nor the queue of the primary structure 
will grow too large. We will use the following lemma. 

Lemma 11: het \S\ = n, andletG{S, ho, hi) be a cuckoo graph with vertex set V = [tv] x [m], with m > 
(1 + e)n for some constant e > 0. There exists a constant C2 > such that with probability 1 — l/n^('°s") 
over the choice of ho and hi, all components ofG{S, ho, hi) are of size at most C2 log^ n. 

Proof: It is a standard calculation that for any node v in the cuckoo graph, PT[Cs,ho,hi{v) > A;] < /3^ for 



some constant (3 S (0, 1) (see e.g. |25 Lemma 2.4]). The conclusion follows by setting k = 0(log n), and 



applying the union bound over all vertices. 



3.3.1 Showing OuterStash Stays Small 



Lemma 12: Let \S\ = n, and let G{S, ho, hi) be any cuckoo graph with vertex set V = [m] x [m], where 
(1 + e)n < m for some constant e > 0. For any s such that s < m}/^, with probabiUty l/m^^^^ the total 
number of vertices that reside in the stash ofG{S, ho, hi) is at most s. 
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The choice of s < m^/^ is fairly arbitrary but convenient and sufficient for our purposes. Notice Lemma 12 
readily implies Lemma [7] 



Proof: Here we follow the work of |22|, which considers the analysis of super-constant sized stashes, 



extending the work of 1 25 1. 



FriB.>t)<Y^minU^^ ,lj/3 



The starting point is Lemmata [3] and [5] above. Specifically, in |22| it is shown that these lemmata imply 
that 

iv^)'r' 

for some constant /3. In particular, we can concern ourselves with values of k that are 0{m^/^), since the 
summation over /J*^ terms for larger values of k is dominated by 2^^^'" ^' = m~^^™' ' . 

It follows that Pr(St, > t) is at most max ( ?n,^^(*),m^^(™ ) j. We therefore claim that Pr(i?„ > 

j + 1) < m^^~"'^ for some constant a for j < m^/®. 

Now, following the derivation of Theorem 2.2 of p5| , we have the probability that the stash exceeds 
size s is given by the probability that 2m independently chosen components have an excess of more than s 
edges, which can be bounded as: 



gk^^as-k 



Fr{eig)>s) < T. {^k) 



k=l 

= m-^(^). 

Here the second to last line follows from a straightforward optimiziation to find the the maximum of {x/k)^ 
(which occurs at /c = x/e). ■ 

We now show formally that the outer stash is small at all points in time. 

Lemma 13: Letvr be anyn-bounded sequence of operations of length p{n) . Fors < m}'^, with probability 
1 — p{n) /m^(*) over the choice of hash functions ho, hi, OuterStash has size at most s at all points in time. 

Proof: We define a good event ^3 that ensures that OuterStash has size at most s. Namely, let ^3 be the event 
that at all points i in time, the number of vertices in the stash of G(S'j, ho, hi) is at most s, and additionally all 
connected components of G{Si, ho, hi) have size at most log^ n. By Lemmata 
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and 
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as well as a union 



bound over all p{n) operations in tt , ^3 occurs with probability at least 1 — p{n)/'m^^^ As in ||5J, a minor 
technical point in applying Lemma 12 is that Si is n + m^/^-bounded, not n-bounded. But we can handle 
this by applying Lemma 12 with e' = e/2, since for large enough m, (1 + e/2)(n + m^''^) < (1 + e)n < m. 

We distinguish between the real stash and the effective stash. The real stash at any point i in time is 
the set of nodes v ^ V such that reside in the stash of G{Si, ho, hi); the effective stash refers to the real 
stash, together with elements that used to be in the real stash but have since had cycles removed from their 
components due to deletions, and have not yet been inserted in the outer cuckoo table. Let Ei denote the set 
of items in the effective stash but not the real stash of G{Si, ho, hi). 

Clearly the event ^3 guarantees that at any point in time, the size of the real stash is at most s. To see that 
the size of the effective stash never exceeds s, assume by way of induction that for j > m^'^, the effective 
stash has size at most s at all times less than j (clearly event ^3 guarantees that this is true for all j < m}-!'^ as 
a base case). In particular, the inductive hypothesis ensures that the effective stash has size at most s at time 
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j — vn}!'^. Observe that during the execution of operations {x _^i/2 , x-^-iji^^, . . . , Xj}, we spend at least 
^/2 i^/A _ ,^1/4 moves in total on the elements of the effective stash. Since all connected components 
have size at most log^ n. Lemma |4] implies the insertion of any item x in the effective stash never requires 
more than 2 log^ n moves before it succeeds or causes x to be returned to the back of the stash. Thus, all 
elements in the effective stash at time j — m}/"^ require at most 2m}/^ log^ n < m}/'^ operations in total to 
process, and it follows that all elements in E-_^\/2 are inserted into the table by time j. Thus, at time j the 
size of the effective stash is at most s, and this completes the induction. ■ 



3.3.2 Showing OuterQueue Stays Small 

Let IS"! = n, and let G{S, ho, hi) be any cuckoo graph with vertex set [m] x [m]. It is well-known that, for 
any node v, there is significant probability that an insertion of v into G takes r2(log n) time. But one might 
hope that for a sufficiently large set of distinct vertices {vi, . . . ,vn}, the average size of the connected 
components of the Vi 's is constant with overwhelming probability over choice of G, and thus any sequence 
of A^ insertions will take 0{N) time in total. Indeed, Lemma 4.4 of Arbitman et al. [5] establishes that, for 
any distinct vertices {vi, . . . , vn} with N < log n, X]i=i Cs^hoM i^i) = 0{N) with probability 1 — 2~^(^) 
over the choice of ho and hi. Roughly speaking, Arbitman et al. use this result to conclude that a logarithmic 
sized queue suffices for deamortizing cuckoo hashing (where all guarantees hold with high probability), 
since any sequence of log n insertions can be processed in 0(log n) time steps. 

Our goal is to achieve guarantees which hold with overwhelming probability. To achieve this, we use 
the fact that we can afford to keep a queue of super-logarithmic size without affecting the asymptotic space 
usage of our algorithm. We show that any sequence of, say, A^ = log^ n operations can be cleared from 
the queue in 0{N) time with probability 1 — l/n'^^^'. It follows that with overwhelming probability the 
queue does not overflow. Unfortunately, the techniques of Q do not generalize to values of N larger than 
0(log n), so we generalize their result using different methods. 

Intuitively, one should picture the random process we wish to analyze as a standard queueing process, 
where the time to handle each job is a random variable. In our case, the random variable for a job - which 
is a key k to be placed - is the time to find a spot for k in the cuckoo table, which is proportional to the 
size of the connected component in which k is placed in the cuckoo graph by Lemma |4] This random 
variable is known to be constant on average and have exponentially decreasing tails, so if the job times were 
independent, this would be a normal queue (more specifically, a Galton-Watson process), and the bounds 
would follow from standard analyses. 

Unfortunately, the job times in our setting are not independent. Roughly speaking, our analysis proceeds 
by showing that with overwhelming probability on a given instance of the cuckoo graph, the expected value 
of the size of the connected component of a randomly chosen vertex is close to its expectation if the graph 
was chosen randomly. 

Our main technical tool will be the following lemma. 

Lemma 14: Let \S\ = n, and let G{S, ho, hi) be a cuckoo graph with vertex set V = [m] x [m]. Let 
{vi, . . . , vn} be a set of N > log^(n) vertices chosen uniformly at random (with replacement) from the 
cuckoo graph G{S, ho, hi). There is a constant c such that with probability 1 — n~^^'°s"^ (over both the 
choice of the Vi 's and the generation of the cuckoo graph G), X^j^^ Cs^hoM (vi) < cN. 
Proof: Let Xi be the number of vertices in G{S, ho, hi) in components of size i, and let /Uj = E[Xj] be the 
expected number of vertices in components of size i in a random cuckoo graph, and let ^u = 2^ ^i=i ^A** 
be the expected size of the connected component of a random node in a random cuckoo graph. By standard 



calculations (see e.g. |25 Lemma 2.4]), there is a constant /3 G (0, 1) such that for any v, 'E[Cs,ho,hi (v)] < 
Yl'h=i f^^ ~ ^(1)' where the expectation is taken over choice of /iq and hi, and hence fi = 0(1). 
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For fixed hash functions h^ and hi, we can write 

1 1 ^™ 

E^ev Cs,ho,/.i (^) = ^ Z] <^-5.'^o,/*i (^) = ^ Z] ^^i- (1) 

Our goal is to show that with overwhelming probability over the choice of /iq and hi, the right hand side 
of Equation [T] is close to /i. We will do this by showing that for small enough i, Xi is tightly concentrated 
around /Xj, and that larger i do not contribute significantly to the sum. 

Lemma 15: The following properties both hold. 

1. Suppose iJLi > n2/3. Then Pr(|Xi - /ii| > n^/^) = 2e~^("'''''). (The Q notation hides factors 
polylogarithmic in n.) 

2. Let i* be the smallest value of i such that fii < r?!'^ , and let X^ be the number of vertices in 
components of size at least i* . Then Pr[X* > 772^/^] < l/n^('°s") for some constant j. 

Proof: 

1. This follows from a standard application of Azuma's inequality, applied to the edge exposure martin- 
gale that reveal the edges of the cuckoo graph one at a time. More specifically, we reveal the edges 
of G one at a time in an arbitrary order, say ei, . . . e„ and let Zj = E[Xj\ei, . . . , ej-i]. Then Zj is 
a martingale, and changing a single edge can only change the number of components of size j by a 
constant (specifically, two), and hence \Zj — Zj^i\ < 2j for all j. Thus, by Azuma's inequality 



Pr(|Xi - E[Xi]\ > 2^X^/Ti) = Pr(|Z„ - Zo| > 2^X^/^) < 26"^ /^. 
Setting A = 1^ = ^(n^/^/i), we see 

FT{\Xi - B[Xi]\ > n2/3) < e-^(-^/Vi^). 

By a standard calculation, E[Xj] > n^/^ implies i < ci log n for some constant ci, and the theorem 
follows. 

2. Since Fi[[Cs,hoMi'") > k] < f3'^ for some constant P G (0,1), it follows easily that E[X^] < 
^^ nP\ where n(3'* < n^/^. It follows that E[X^] = 0{n^/^). 

In order to get concentration of X^ about its mean, we use a slight modification of the edge expo- 
sure martingale, which will essentially allow us to assume that all connected components have size 
0(log^ n) when attempting to bound the differences between martingale steps, which happens with 



very high probability by Lemma 11 This technique is formalized for example in Theorem 3.7 of [32]. 
Let Q be the event that all connected components are of size at most log^ n. We reveal the edges 
of G one at a time in an arbitrary order, say ei, . . . en and let Zj = £^[X*|ei, . . . , ej_i] if the edges 
ei, . . . ej-i do not include a component of size greater than log^ n, and Zj = Zj^i if the edges 
ei, . . . Cj-i do include a component of size greater than log^ n. Then Zj is a martingale, and since 
changing a single edge can only change the number of components of size i by at most two, we see 
that \Zi — Zj_i| < 2clog^ n. Now as Zn will equal X^ except in the case where event Q does not 
hold, we can apply Azuma's inequality to the above martingale to conclude that 

Pr(|X, - E[X,]| > 2cAV^logn) < 2e-^'/2 ^ Pr{-.Q). 

1/3 
Setting A = , '%, . ^ , we obtain 

° clog (n)^/n 

Pr(|X, - E[X,]| > nV3) < g-f^C-'/^) + ^ 



^r2(logra) ' 
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Noting that E[X=k] = 0{n?/^), we conclude 



P^(^* ^ ^^''"^ ^ ^ 



for some constant 7. 



Properties 1 and 2 of Lemma 15 together imply that with very high probability over choice of /iq and 

hi, 

2m i*~l 2m 

^ ^X, = ^ (^, ± n2/3) + 0(n2/3) < ^ 2/., + 0(^^/3 bg^ n). 
Combining the above with Equation[T] with overwhelming probability over choice of /iq and hi it holds that 

2m 

Thus, we have shown that with very high probability over the choice of G, there is a constant C3 such that 

^vev[Cs,ho,hi(.^)] ^ C3. 



Our last step in proving Lemma 14 is to show that if we choose a set of vertices at random, the sum of 
the component sizes is concentrated around its mean. Indeed, by applying a similar argument as in the proof 
Property Two, we can assume for all v G V, Cs,ho,hi (v) < C2 log^ n, as long as we add an 
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of Lemma 

additional term of Qi^t^n) ^^ the bound on the probability obtained using Azuma's inequality. This yields 

s . 

P:ci\Y,CsMMiv^) - csl > A^/5c2log2n) < 2e-"'/2 ^ _____ 

2=1 



Setting A = a/S*/ log^ n yields 

Pr{\t^Cs,,,Miv^) - C3| > c,) < e-^(^/i°^*") + -^. 
j=i 

The conclusion follows, with c = C3 + C2. ■ 

Notice that for any set of distinct items {xi, . . . , xn} to be inserted, the sets {/lo(xi), . . . , ho{xN)} and 
{hi{xi), . . . , hi{xN)} are uniformly distributed set of vertices in G{S, ho, hi). Thus, Lemma 14 ensures 
that for distinct items {xi, . . . , xn} with N > log^ n, and for any set 5 of size n, with probability at least 
1 - l/ni°g" over the choice of ho and hi, X^^Ii CsmM (^i) < cN. 

With this in hand, we are ready to show that with overwhelming probability OuterQueue does not exceed 
size log^ n over any sequence of poly(ri) operations. 

Lemma 16: With probability 1 — n^^\^°^^\ the queue of the primary structure has size at most log^ n at 
all times. 

Proof: We define a good event ^4 that ensures that OuterQueue has size at most log^ n. Namely, let ^4 be the 
event that for all times log^ n < j < p{n), it holds that Yll= -lo ^m^s f f (^*) — '^^og^ "^^ ^Y Lemma 
14 and a union bound over all p{n) operations in vr, ^4 occurs with probability at least 1 — l/n^°s". 

For j > log^ n, suppose by induction there are at most log^ n items in the effective queue at all times 
less than j (as a base case, this is clearly true for all j < log® n). In particular, this holds at time j — log® n. 
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We can assume all items in the effective queue at time j — log^ n are distinct because we process deletions 
immediately. Since all the (at most) log^ n items in the effective queue are distinct, event ,^4 guarantees that 
all of these items can be cleared from the queue in clog^ n steps. Setting the number of steps expended on 
elements of the queue every operation to a = c, all (at most) log^ n items in the effective queue at time 
j — log^ n will be cleared from the queue before time j. Thus, at time j, the queue contains at most log^ n 
items, and this completes the induction. ■ 

Q only contains items from OuterStash and OuterQueue, and we have shown both deques contain log^ m 
items with overwhelming probability. Theorem[2]follows. 

3.4 Putting it All Together 

For any constant e > 0, our nested cuckoo construction uses (2 + e)n words for the outer cuckoo table, 
0{m'^/^) for the inner structure Q, and (with overwhelming probability) 0(log^ m) words for the cycle- 
detection mechanisms. The latter two space costs are dominated by the first, and so our total space usage is 
(2 + e)n words in total for constant e > 0. 

We derive our final theoretical guarantees on the running time of each operation for both of our con- 
structions. 

Construction One: Inserts take 0(1) time by design. Lookups and deletions require examining Tq [ho{k)] 
Ti[hi{k)], Ro[fo{k)], and Ri[fi{k)], and performing a lookup in L, which in Construction 1 potentially 
requires examining all elements in L. Theorems [I] and |2] together imply that for < s < log^ n, with 
probability at least 1 — l/m^^^\ L contains only s items. Thus, lookups and removals take time s with 
probability at least 1 — l/m^^^\ even though s is not a tuning parameter of our construction. 

To clarify, the hidden constant in the exponent is fixed, i.e. independent of all parameters. Thus, for 
any constant c, there is some larger constant s such that the probability lookups and removals take time 
more than s is bounded by 1 — Xjrrf. We also remark that it is straightforward to extend our analysis to all 
s < m}'^'^, but for clarity we have not presented our results in this generality. 

Construction Two: Again, inserts take 0(1) time by design. As in Construction One, lookups and 
deletions require examining To[/io(A:)], Ti[hi{k)\, i?o[/o(^)]. and Ri[fi{k)\, and performing a lookup in 
L; assuming L has size s = O(log^'^n), such a lookup can be performed in constant time using our 
atomic stash. Theorems [T] and |2] therefore imply that lookups and removes take 0(1) time with probability 

1_ l/^!^(logi/*n)_ 

3.5 Extensions 

3.5.1 poly log (n) -wise Independent Hash Functions 

We remark that an argument of Arbitman et al. [5] implies almost without modification that in Construction 
Two, insertions, deletions, and lookups take 0(1) time with overwhelming probability even if the hash 
functions /iq, hi, /o, and /i are chosen from poly log(n)-wise independent hash families. For completeness, 
we reproduce this argument in our context. 

In the analysis above, the only places we used the independence of our hash functions were in Lemmata 
|7J[9j[T2J and[l4]above. These lemmata allowed us to define four events that occur with high or overwhelming 
probability, whose occurrence guarantee that the our time bounds hold. Specifically, for any fixed s, our time 
bounds for Construction Two hold if none of the following "bad" events occur: 

1. Event 1: There exists a set {vi, . . . , vn} of N = 0(log^ n) vertices in the out er cuc koo graph, such 



^Af 



that Yli=i Cs,ho,hi (vi) > cN (this is the complement of event ^4 from Section 
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3.3.21. 



2. Event 2: There exists a set of at most 2m vertices in the outer cuckoo graph, such that the number 
of stashed elements from the set exceeds 0(log^ n) (this is the complement of event ^3 from Section 



331}. 

3. Event 3: There exists a set of at least 0(log^'^ n) vertices in the inner cuckoo graph, all of whose 
connected components at some point in time have size greater than 8 (this is the complement of the 



event ^2 defined in Section 3.2.2 1 



4. Event 4: There exists a set of vertices in the cuckoo graph, such that the number of stashed elements 
from the set at some point in time exceeds 0(log^' ^ n) (this is the complement of the event ^1 defined 



in Section 3.2.1 ) 



Lemmata |7j |9} [T2j and 14 ensure that if ho, hi, /o, and /i are fully random, none of the four events 
occur with probability at least 1 — l/n^('°s ^""^h In order to show that the conclusion holds even if the 
hash functions are poly log (n)- wise independent, we apply a recent result of Braverman |[8| stating that 
polylogarithmic independence fools constant-depth boolean circuits. 

Theorem 17: ( [8]) Let s > log m be any parameter. Let F be a boolean function computed by a circuit of 
depth d and size m. Let ji be an r-independent distribution where 

r > 3 • 60'^+3 . (logm)('^+i)('^+3) . ^d(d+3)^ 

then I E^[F] - E[F]| < 0.82s • 15m, 

Theorem [TT] implies that if we can develop constant-depth boolean circuits of quasi-polynomial size that 
recognize Events 1-4 above, then the probability any of the events occur under polylogarithmic independent 
hash functions will be very close to the probability the events occur under fully random hash functions. The 
circuits that recognize our events are similar to those used in Arbitman et al. |[5| ; the input wires to the first 
two circuits contain the values /lo(xi), /ii(xi), ..., /lo(xn), /ii(x„) (where the XjS represent the elements 
inserted into the outer cuckoo table), while the input wires to the second two circuits contain the values 
foixi), fi{xi), ..., fQ{xj), fi{xj), where j is the number of items in the inner cuckoo table. 

1 . Identifying Event 1 : Just as in ||5J , this event occurs if and only if the graph contains at least one forest 
from a specific set of forests of the bipartite graph on [m] x [m\, where m = {1 + e)n. We denote 
this set of forests by Tn, and observe that Tn is a subset of all forests with at most cN = 0(log^(n)) 
vertices, which implies that \Tn\ = nP°^y^°§("). Therefore, the event can be identified by a constant- 
depth circuit of size nP"^^^"^^ ^^^ simply enumerates all forests inn J>i, and for every such forest 
checks whether it exists in the graph. 

2. Identifying Event 2: A constant-depth circuit identifying this event enumerates over all S C [m] x 
[m] of size log^'^(n) and checks whether all of elements of S are stashed. As in [5|, a minor 
complication is that we must define a canonical set of stashed elements for each set S; this is only for 
simplifying the analysis, and does not require modifying our actual construction. Our circuit checks 
whether all elements of S are stashed by, for each x S 5, enumerating over all connected components 
in which edge {ho{x), hi{x)) is stashed according to the canonical set of stashed items for S and 
checking if the component exists in the graph. We may assume Event 1 does not occur, and thus we 
need only iterate over components of 0(log^(n)) vertices. The circuit thus has 0(71^°^^ '°s(")) size. 

3. Identifying Events 3 and 4: We may assume Events 1 and 2 do not occur. Then there are at most 
0(log^ n) edges in the inner cuckoo table, so a constant depth circuit of quasipolynomial size simply 
enumerates over all possible edge sets E' of size 0(log^ n) satisfying Event 3 or Event 4 and checks 
if E' equals the input to the circuit. 



Thus, we can set s = polylog(n) and m = poly log(n) in the statement of Theorem 17 to conclude 



that, even if we use hash functions from a polylog(n)-wise independent family of functions. Events 1- 
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4 still only occur with negligible probability. We remark that similar arguments demonstrate that when 
polylog(n)-wise independent hash functions are used, all operations under Construction 1 still take 0{s) 
time with probability 1 — 1/m^^^^ when s = polylog(n) (the amount of independence required depends 
on s). 

3.5.2 Sufficiently Independent Hash Functions Evaluated in 0(1) time 

Unfortunately, all known constructions of poly log (n) -wise independent hash families that can be evaluated 
in the RAM model while maintaining our 0{n) space bound come with important caveats. The classic con- 
struction of fc-wise independent hash functions due to Carter and Wegman based on degree-A; polynomials 
over finite fields requires time 0{k) to evaluate; ideally we would like 0(1) evaluation time to maintain our 



time bounds. The work of Siegel 1 39 1 is particularly relevant; for polynomial-sized universes, he proves the 
existence of a family of n'^-wise independent hash functions f or e > (which is super-logarithmic), which 
can be evaluated in 0(1) time using look up tables of size n^ for some 6 < 1. However, his construction 
is non-uniform in that he relies on the existence of certain expanders for which we do not possess explicit 
constructions. Subsequent works that improve and/or simplify | [39| (e.g. |[T8j[T9j|33|) all possess polynomial 



probabilities of failure, which render them unsuitable when seeking guarantees that hold with overwhelming 
probability. The development of uniformly computable hash families which can be evaluated in 0(1) time 
using o{n) words of memory remains an important open question. 

3.5.3 Achieving Loads Close to One 

We remark that we can achieve 0(1) worst-case operations with overwhelming probability using (1 + e)n 
words of memory for any constant e > by substituting our fully de-amortized nested cuckoo hash tables in 
for the "backyard" cuckoo table of Arbitman et al. f6l. The construction of [6| uses a main table consisting of 
(1 + e/2)d buckets, each of size d for some constant d, and uses a de-amortized cuckoo table as a "backyard" 
to handle elements from overflowing buckets. They show that for constant e > 0, with overwhelming 
probability the backyard cuckoo table must only store a small constant fraction of the elements. Note that 
n"-wise independent hash functions for some a > are required to map items to buckets in the main 
table, and therefore the technique cuts our space usage by a factor of about 2, but increases the amount of 
independence we need to assume in our hash functions for theoretical guarantees to hold. 

4 Cache-Oblivious Multimaps 

In this section, we describe our cache-oblivious implementation of the multimap ADT. To illustrate the 
issues that arise in the construction, we first give a simple implementation for a RAM, and then give 
an improved (cache-oblivious) construction for the external memory model. Specifically, we describe an 
amortized cache-oblivious solution and then we describe how to de-amortize this solution. 

In the implementation for the RAM model, we maintain two nested cuckoo hash tables, as described 
in Section [2] The first table enables fast containsItem(A;, v) operations; this table stores all the (A;, v) pairs 
using each entire key-value pair as the key, and the value associated with (fc, v) is a pointer to u's entry in 
a linked list L{k) containing all values associated with k in the multimap. The second table ensures fast 
containsKey(fc), getAll(/i;), and removeAll(/i;) operations: this table stores all the unique keys k, as well as 
a pointer to the head of L{k). 

Operations in the RAM implementation. 

1. containsKey(A;): We perform a lookup for k in Table 2. 

2. containsItem(/c, v): We perform a lookup for (/c, f ) in Table 1. 
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3. add(A;, v): We add {k, v) to Table 1 using the insertion procedure of Sectionll] We perform a lookup 
for k in Table 2, and if k is not found we add k to Table 2. We then insert v as the head of the linked 
list corresponding to Table 2. 

4. remove(/i;, v): We remove (/c, v) from Table 1, and remove v from the linked list L{k); if v was the 
head of L{k), we also perform a lookup for k in Table 2 and update the pointer for k to point to the 
new head of L{k) (if L{k) is now empty, we remove k from Table 2.) 

5. getAll(A;): We perform a lookup for A; in Table 2 and return the pointer to the head oi L{k). 

6. removeAll(/!;): We remove k from Table 2. In order to achieve unamortized 0(1) I/O complexity, 
we do not update the corresponding pointers of {k,v) pairs in Table 1; this creates the presence of 
"spurious" pointers in Table 1 , but Angelino et al. |[3| explain how to handle the presence of such 
spurious pointers while increasing the cost of all other operations by 0(1) factors. 

All operations above are performed in 0(1) time in the worst case with overwhelming probability by 
the results of Section [2] Two major issues arise in the above construction. First, the space-usage remains 
0{n) only if we assume the existence of a garbage-collector for leaked memory, as well as a memory 
allocation mechanism, both of which must run in 0(1) time in the worst case. Without the memory 
allocation mechanism, inserting v into L{k) cannot be done in 0(1) time, and without the garbage collector 
for leaked memory, space cannot be reused after remove and removeAU operations. Second, in order to 
extract the actual values from a getAll(A;) operation, one must actually traverse the list L{k). Since L{k) 
may be spread all over memory, this suffers from poor locality. 

We now present our cache-oblivious multimap implementation. Our implementation avoids the need for 
garbage collection, and circumvents the poor locality of the above getAU operation. We do require a cache- 
oblivious mechanism to allocate and deallocate power-of-two sized memory blocks with constant-factor 
space and I/O overhead; this assumption is theoretically justified by the results of Brodal et al. [llj . 

Amortized Cache- Oblivious Multimaps. As in the RAM implementation, we keep two nested cuckoo 
tables. In Table 1, we store all the (fc, v) pairs using each entire key-value pair as the key. With each such 
pair, we store a count, which identifies an ordinal number for this value v associated with this key, k, starting 
from 0. For example, if the keys were (4, Alice), (4, Bob), and (4, Eve), then (4, Alice) might be pair 0, (4, 
Bob) pair 1, and (4, Eve) pair 2, all for the key, 4. 

In Table 2, we store all the unique keys. For each key, k, we store a pointer to an array, Ak, that stores 
all the key-value pairs having key k, stored in order by their ordinal values from Table 1. With the record 
for a key k, we also store n^, the number of pairs having the key k, i.e., the number of key-value pairs in 
Afc. We assume that each A^ is maintained as an array that supports amortized 0(l)-time element access 
and addition, while maintaining its size to be 0{nk)- 

Operations. 

1. containsKey(A;): We perform a lookup for k in Table 2. 

2. containsItem(A;, v): We perform a lookup for (A;, f) in Table I. 

3. add(A;, v): After ensuring that {k, v) is not already in the multimap by looking it up in Table 1, we 
look up k in Table 2, and add (A:, v) at index n^ of the array A}^, if k is present in this table. If there 
is no key k in Table 2, then we allocate an array, Ak, of initial constant size. Then we add (fc, v) to 
^fc[0] and add key k to Table 2. In either case, we then add (fc, v) to Table 1, giving it ordinal njt, and 
increment the value of Uk associated with k in Table 2. This operation may additionally require the 
growth of Ak by a factor of two, which would then necessitate copying all elements to the new array 
location and updating the pointer for k in Table 2. 

4. remove(/c, v): We look up {k, v) in Table 1 and get its ordinal count, i. Then we remove {k, v) from 
Table 1, and we look up k in Table 2, to learn the value of Uk and get a pointer io Ak. If n^ > 1, we 
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swap {k', v') = Ak[nk — 1] and {k, v) = Af^li], and then remove the last element of A^. We update 
the ordinal value of {k' ,v') in Table 1 to now be i. We then decrement the value of nt associated with 
k in Table 2. If this results in n^ = 0, we remove k from Table 2. This operation may additionally 
require the shrinkage of the array A^ by a factor of 2, so as to maintain the 0{n) space bound. 

5. getAll(A;): We look up k in Table 2, and then list the contents of the n^ elements stored at the array 
Ak indexed from this record. 

6. removeAll(/s): For all entries {k,v) of ^4^, we remove {k,v) from Table 1. We also remove k from 
Table 2 and deallocate the space used for Af^. As in the RAM implementation, in order to achieve 
unamortized 0(1) I/O cost, we do not update the pointers of {k, v) pairs in Table 1; this creates the 
presence of "spurious" pointers in Table 1 which are handled the same as in the RAM case. 

In terms of I/O performance, containsKey(A;) and containsItem(A;, v) clearly require 0(1) I/Os in the 
worst case. getAll(A;) operations use 0(1 + Uk/B) I/Os in the worst case, because scanning an array of 
size Hk uses Odnk/Bl) I/Os, even though we don't know the value of B. removeAll(A;) utilizes 0{nk) 
I/Os in the worst-case with overwhelming probability, but these can be charged to the insertions of the n^ 
values associated with k, for 0(1) amortized I/O cost. a.dd{k, v) and remove(A;, v) operations also require 
0(1) amortized I/Os with overwhelming probability; the bound is amortized because there is a chance this 
operation will require a growth or shrinkage of the array Ak, which may require moving all {k,v) values 
associated with k and updating the corresponding pointers in Table 1 . 

In the next section, we explain how to deamortize add(A;, v) and remove(A;, v) operations. 

De-Amortizing the Key- Value Arrays. To de-amortize the array operations, we use a rebuilding technique. 



which is standard in de-amortization methods (e.g., see |38 1). 

We consider the operations needed for insertions to an array; the methods for deletions are similar. The 
main idea is that we allocate arrays whose sizes are powers of 2. Whenever an array. A, becomes half full, 
we allocate an array. A', of double the size and start copying elements A in A'. In particular, we maintain a 
crossover index, ia, which indicates the place in A up to which we have copied its contents into A'. Each 
time we wish to access A during this build phase, we copy two elements of A into A', picking up at position 
lA, and updating the two corresponding pointers in Table 1. Then we perform the access of A, as would 
would otherwise, except that if we wish access an index i < ia, then we actually perform this access in A'. 
Since we copy two elements of A for every access, we are certain to complete the building of A' prior to our 
needing to allocate a new, even larger array, even if all these accesses are insertions. Thus, each access of 
our array will now complete in worst-case 0(1) time with overwhelming probability. It immediately follows 
that add(A;, v) and remove(fe, v) operations run in 0(1) worst-case time. All time bounds in Table [T] follow. 

5 Conclusion 

In this paper, we have studied fully de-amortized dictionary and multimap algorithms that support worst- 
case constant-time operations with high or overwhelming probability. At the core of our result is a "nested" 
cuckoo hash construction, in which an inner cuckoo table is used to support fast lookups into a queue/stash 
structure for an outer cuckoo table, as well as a simplified and improved implementation of an atomic stash, 
which is related to the atomic heap or q-heap data structure of Fredman and Willard [20]. We gave fully 
de-amortized constructions with guarantees that hold with high probability in the Practical RAM model, and 
with overwhelming probability in the external-memory (I/O) model, the standard RAM model, or the AC'' 
RAM model. 

Several interesting questions remain for future work. First, lookups in our structure may require four or 
more I/Os in external-memory; it would be interesting to develop fully de-amortized structures supporting 
lookups in as few as two I/Os. A prime possibility suited for external memory is random-walk cuckoo 
hashing with two hash functions and super-constant bucket sizes. Second, it would be interesting to develop 
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a fuUy-deamortized dictionary for the Practical RAM model where all operations take 0(1) time with 
overwhelming probability. 
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A Our Atomic Stash Structure 

In this section, we describe our deterministic atomic stash implementation of the dictionary ADT. This 
structure dynamically maintains a sparse set, S, of at most 0{w'^/'^) key-value pairs, using 0{w) words 
of memory, so as to support insertions, deletions, and lookups in 0(1) worst-case time, where w is our 
computer's word size. Our construction is valid in the I/O model and the ACO RAM model [40l, that is, 
the RAM model minus constant-time multiplication and division, but augmented with constant-time ACO 
instructions that are included in modem CPU instruction sets. (See previous work on the atomic heap |20| 
data structure for a solution in the standard RAM model, albeit in a way that limits the size of S to be 
0{w^'^) and is less efficient in terms of constant factors.) We assume that keys and values can each fit in a 
word of memory. 

Our construction builds on the fusion tree and atomic heap implementations of previous researchers |[2j 



20 40 1, but improves the simplicity and capacity of these previous implementations by taking advantage of 
modern CPU instruction sets and the fact that we are interested here in maintaining only a simple dictionary, 
rather than an ordered set. For instance, our solution avoids any lookups in pre-computed tables. 

A.l Components of an Atomic Stash 

Our solution is based on associating with each key fc in 5, a compressed key, B^, of size w' = Yw^/'^\ — 1, 
which we store as a representative of k. In addition, for each compressed key, B^, we store a binary mask, 
Mfc, of equal length, such that 

Bk^Mk^Bj^Mk, 

for j / k, with j in S, where "A" denotes bit-wise AND. That is, all of the masked keys in S are unique. 

A critical element in our data structure is an associative cache, X, which is stored in a single word, 
which we view as being divided into w' fields of size w' each, plus a (high-order) indicator bit for each field, 
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sow> w'{w' + 1). We denote the field with index i in X as X{i) and the indicator bit with index i (for field 
i) in X as X[i], and we assume that indices are increasing right-to-left and begin with index 1. Thus, the bit 
position of X[j] in X is j{w' + 1). Note that with standard SHIFT, OR, and AND operations, we can read 
or write any field or indicator bit in X in 0(1) time given its index and either a single bit or a bit mask, 1, of 
w' I's. Each field X{i) is either empty or it stores a compressed key, B^, for some key k in S. In addition, 
we also maintain a word, Y, with indices corresponding to those in X, such that Y{i) stores the mask, M^, 
if X(i) stores the binary key Bj.. We also maintain key and value arrays, K and V , such that, if Bk is stored 
in X{i), then we store the key-value pair (fc, v) in K and V, so that K[i] = k and V[i\ = v. To keep track 
of the size of S, we maintain a count, ns, which is the number of items in S. Finally, we maintain a "used" 
mask, U, which has all I's in each field of X that is used. That is, U{i) = 1 iff X{i) holds some key, Bj^, 
which is equivalent to saying Y{i) ^ 0. 

The reason we use both a compressed key and a mask for each key stored in S is that, as we add keys to 
S, we may need to sometimes expand the function that compresses keys so that the masked values of keys 
remain unique, while still fitting in a field of X. Even in this environment, we would like for previously- 
compressed keys to still be valid. In particular, our method generates a sequence of compression functions, 
pi, p2, etc., so that pi returns a w'-hit string whose first i bits are significant (and can be either or 1) and 
whose remaining bits are all O's. In addition, if we let Af^ denote a bit mask with d significant 1 bits and 
w' — d following bits, then, for d > 2, 

Pd{k)AM''-'=Pd^^{k), 

for any key k. 

A.2 Operations in an Atomic Stash 

Let us now describe how we perform the various update and query operations on an atomic stash. Assume 
we may use the following primitive operations in our methods: 

• A binary operator, "©," which denotes the bit-size XOR operation. 

• DUPLICATE(i?): return a word Z having the binary key B (of size w') stored in each of its fields. 
(Note: we can implement DUPLICATE either using a single multiplication or using 0(1) instructions 
of a modern CPU.) 

• VecEQ(VF, B): given a word W and a binary key B (of size w'), set the indicator bit W[i] = 1 iff 
B = W{i). This operation can be implemented using standard ACO operations |[2]|40}. 

• MSB{W): return the index of the most significant 1 bit in the word W, or if W = 0. As Thorup 
observed |40], this operation can be implemented, for example, by converting W to floating point and 
returning the exponent plus 1. 

We perform a getIndex(A:) operation, which returns the index of key k in X, or if A: is not a key in S, 
as follows. We assume in this method that we have access to the current compression function, p^. 

• getIndex(A;): 

B ^ pd{k) 

Z ^ DUPLICATE(S) 

T ^ y A (X © Z) 

VecEQ(i?, 1) 

returnMSB(i?)/('u;' + l) 

The correctness of this method follows from the fact that B is the key in S at index i associated with k 
iff T{i) is all O's and index i is being used to store a key from S, since all masked keys in S are unique (and 
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if B is not in X, then MSB(i2) = 0). Also, if one desires that we should avoid integer division, then we can 
define w' so that {w' + 1) is a power of 2, while keeping w > w'{w' + 1), so that above division can be done 
as a SHIFT. In addition, note that we can implement a get(fc) operation, by performing a call to getlndex(fc) 
and, if the index, i, returned is greater than 0, then returning (i^[i], V[i]). 

To remove the key-value pair from S associated with a key k in S, we perform the following operation: 

• remove{k): 

i ^ g&t{k) 

X{i) ^ X{ns) 

Y{{) ^ Y{ns) 

U{ns) ^ 

K[i] ^ K[ns\ 

V[i\ ^ V[ns] 

ns ^ ns -I 
Our insertion method, which follows, assumes that we have access to the current compression function, 
Pd, as well as the mask, M'^, of d I's followed by {w' — d) O's. We also assume we know the current value 
of (the global variable) d and that we have a function, P_EXPAND(A;i , ^2), which takes two distinct keys, fci 
and k2, and expands the compression function fromp^ to p^+i so thai pdj^i{ki) 7^ Pd+i{k2)- This method 
also defines M'^^^ to consist of (d + 1) significant 1 bits and w' — {d+ 1) trailing O's, and it increments d. 

• add(A;,t;): 

i <r- get(/c) 

if i > then {We had a collision.} 

P_EXPAND(A;, K[i]) {Also increments d] 
Xii) ^ Pd{K[i\) 
Y{i) ^ W^ 
ns ■^ ris + 1 
X{ns) ^ Pd{k) 
Y{ns) ^ W^ 
U{ns) ^ 1 
K[ns] ^ k 
V[ns] ^ V 
Since the compressed keys in S form a set of unique masked keys, the new key, k, will collide with at 
most one of them, even when compressed. So, it is sufficient in this case that we expand the compression 
function so that it distinguishes k and this one colliding field. Thus, a simple inductive argument, we 
maintain the property that all the masked keys in S are unique. 

A.3 Compressing Keys and De-Amortization 

Let us now describe how we compress keys and expand the compression function, as well as how we perform 
the necessary de- amortization so that the size of compressed keys is never more than w'. 

Our method makes use of the following primitive ACO operation ||2 40|, which can also be computed 
from included primitives in modem CPU instruction sets. 

• SELECT(Ty, k): Given a word W and key k, the fields of W are viewed as bit pointers. A length w' 
bit string, B, is returned so that the i-th bit of B equals the W{i)-th bit of k. 

Our compression function is encoded in terms of the counter, d, and a word, W, that encodes the bits 
to be selected from keys, so that W{i) is the index of the ith bit of k in the output of Pd{k), for i < d. For 
i > d, W{i) = 0. Thus, we define pd simply as follows. 
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• Pdik): 

return SELECT(Ty, k) 
Thus, we also have a simple definition for P_EXPAND: 

• P_EXPAND(fci,A;2): 

d^ d+1 

W{d)^MSB{ki®k2) 

j^d ^ j^d-i y ^^ SHIFT d) 

Note that in the way we are using the expansion function, pd, we only ever add bits to the ends of our 
compressed keys. We never remove bits. This allows keys compressed under previous instantiations of the 
compression function to continue to be used. But this also implies that, as we continue to add keys to S, 
we might run out of fields to use, even if we keep the size, ns, of S to be at most, say, w' /2. Of course, 
we could use a standard amortized rebuilding scheme to maintain d to be at most w' , but this would require 
using amortization instead of achieving true worst-case constant-time bounds for our updates and lookups. 

As a de- amortization technique, therefore, let us revise our construction, so that, whenever d > w' /2, we 
create a new, initially empty, atomic stash. For each additional atomic-stash operation after this initialization, 
we remove two items from the old atomic stash and add them to this new atomic stash. In performing 
accesses and updates during this time, we also keep a crossover index, x, so that references to fields and 
indicator indices less than or equal to x are done in the new stash and references to fields and indicator 
indices greater than x are done in the old stash. Thus, after w' /2 additional operations, we can discard the 
old stash (for which d < w') and fully replace it with the new one (for which d < w' /2). Therefore, so long 
as ns < w' /2, we can perform all insertion, deletion, and lookup operations in worst-case 0(1) time, which 
in the I/O model corresponds to 0(1) I/Os. 
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